Linear Regression

In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression.

https://en.wikipedia.org/wiki/Linear_regression


In [ ]:
# Linear Regression
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import LinearRegression

In [ ]:
# load the diabetes datasets
# datasets this is a sklearn utility which loads a dataset for you
# existing datasets include load_boston (boston house prices)
# or load_iris (flower data set)
# Here, we load the diabetes dataset
# To access the feature vector, call datasets.data => returns X, a 442x10 array
# To acess the target , call datasets.target => returns Y, an array with 442 entries
# Basically, we have 442 samples, each sample being describes by 10 variables (= features)
dataset = datasets.load_diabetes()

In [ ]:
#Let us now build a pandas dataframe hosting the data at hand

# We first need the list of feature names for our columns
# BMI is the Body Mass Index
# ABP is the Average Blood Pressure
lfeat = ["Age", "Sex", "BMI", "ABP", "S1", "S2", "S3", "S4", "S5", "S6"]

In [ ]:
# We now build the Dataframe, with the data as argument
# and the list of column names as keyword argument
df_diabetes = pd.DataFrame(dataset.data, columns = lfeat)

In [ ]:
# Let's have a look at the first few entries
print "Printing data up to the 5th sample"
df_diabetes.iloc[:5,:] # Look at the first 5 samples for all features.

In [ ]:
# We also want to add the regression target
# Let's create a new column :
df_diabetes["Target"] = dataset.target # Must have the correct size of course

In [ ]:
#Let's review our complete dataframe:
print
print "Printing data up to the 5th sample"
print "Also print the target"
df_diabetes.iloc[:5,:] # Look at the first 5 samples for all features incuding target

In [ ]:
# We will fit a linear regression model to the data.
# This is the most basic supervised learning method
# Let us first introduce some notation
# We write x = (x1, x2, ... x10) a row vector of X
# We model the response y as a linear combination of the columns of x :
# y = b0 + b1x1 + ... + b10x10 (b0 is called the intercept)
# We now want to find the vector b = (b0, b1, .. b10) which gives the best representation of the data.
# Best is generally understood with respect to a loss criterion.
# The most popular is "called least squares".
# We try to minimize the squared error i.e. (y-b0 + b1x1 + ... + b10x10)**2
# With several samples (more than one row vector of X),
# We minimize the sum of the squared errors.
# In vector notation, this means minimizing ||(Y-Xb)||**2 (1)
# where we have added a column of ones to X to account for the intercept
# The value of b which minimises (1) can be found analytically or with gradient descent.

In [ ]:
# fit a linear regression model to the data
# This creates an instance of the Linear Regression class of model
model = LinearRegression()

In [ ]:
#the model.fit command will automatically find the coefficient b we talked about earlier
# we only have to specify X = dataset.data and Y = dataset.target
# with pandas we have to split the df in two :
# the feature part (X) and the target part (Y)
# This is done below :

data = df_diabetes[lfeat].values
target = df_diabetes["Target"].values

model.fit(data, target)
print(model)

In [ ]:
# Once the model has been fitted, it can be used to make new predictions
# (A new prediction is simply computed as y = b0 + b1x1 + ... + b10x10 
#  where b was found in the fitting step and x is any feature vector/matrix
# to this end, we simply call model.predict(x)
expected = target
predicted = model.predict(data)

In [ ]:
# summarize the fit of the model
# we can estimate the performance of the fit using the mse metric
# in this case, we simply compute the mean of the squared error on the sample
# the lower, the better
mse = np.mean((predicted-expected)**2)
print("Residual sum of squares: %.2f" % mse)
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % model.score(data, target))

Plotting results

Here we have used 10 features to create a model. This cannot be plotted in 2D. Below is an example from the scikit learn web site showing just one feature and plotting the results. http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#example-linear-model-plot-ols-py


In [ ]:
%matplotlib inline

# Code source: Jaques Grobler
# License: BSD 3 clause

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

# Load the diabetes dataset
diabetes = datasets.load_diabetes()


# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',
         linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

Business formulation

Our dataset is the diabetes dataset See diabetes_dataset_desc.txt for a complete description

  • How does a doctor go about diagnosing diabetes ?
  • We assume the doctor has already treated many patients
  • So given his examination of a new patient,
  • He uses his experience to find out whether he has diabetes or not.

How can we do the same mathematically ?

  • We have gathered data i.e. measures on the patient + current state of the disease
  • We will see that this data can be used to give us expertise in diagnosing diabetes

Perhaps the most intuitive is to look for linear relations By linear, we mean direct proportionality For instance, you may want to predict how many people in the classroom get sunburnt given the average temperature You may observe that for average 30 degrees, 10 people get sunburnt So you extrapolate saying that for 40 degrees, 40*10/30 = 13.3 people will get sunburnt

Building on that, you want to predict the status of the patient given an observation For instance, you may think that a patient is more likely to be ill the larger his weight is So you may go about modelling this as illness = weight * (positive coefficient) That is, the higher the weight, the more dramatic the illness And you can find out the coefficient from existing data

Going beyond that, you may think that there are several causes for the disease

It could be that the older you get, the likelier you are to be sick so illness = age * (positive coefficient)

The fundamental idea of multivariate regression is that you may find a better prediction of the disease if you incorporate several causes. So you model illness as illness = age (c1) + weight (c2) + other_cause * (c3) + ....

And then you use the data to find out the value of the coefficients


In [ ]: